Numeric optimization#
Team members:#
Project manager - Adema
Technical writer - Daulet
Author of executable content - Meiram Sopy
Designer of interative plots - Amina
Designer of quizzes - Adilzhan
Link of the book here
To find the optimal weights of the logistic regression, we can use gradient descent algorithm. To apply this algorithm, one need to calculate the gradient of the loss function.
Gradient Descent (GD)#
The objective is to find the weights \((w)\) that minimize the logistic loss function, given by:
Here, \( N \) represents the number of samples, \( y_i \) is the true label of the \( i \)-th sample (0 or 1), and \( p_i \) is the predicted probability of the \( i \)-th sample belonging to class 1.
Calculating the Gradient#
The key step in gradient descent is to compute the gradient of the loss function with respect to each weight (w_j). For logistic regression, the gradient is given by:
This gradient reflects how the loss changes concerning each weight, guiding the optimization process.
Update Rule#
The weights are then updated using the following rule:
Here, \( \alpha \) is the learning rate, determining the size of the step taken in each iteration.
Example#
Let’s consider a simple example with binary classification. Suppose we have a dataset with one feature (x) and binary labels (y). The logistic regression model predicts the probability of class 1 using the sigmoid function:
The logistic loss for a single sample is:
By applying gradient descent, we iteratively update \( w_0 \) and \( w_1 \) to minimize the overall loss across all samples, converging to optimal weights for the logistic regression model.
Binary logistic regression#
Multiply the loss function (25) by \(n\):
To find \(\nabla \mathcal L(\boldsymbol w)\) observe that
Also,
Q. What is \(\nabla(\boldsymbol x_i^\top \boldsymbol w)\)?
Putting it altogeter, we get
Question
How to write \(\nabla \mathcal L(\boldsymbol w)\) as a product of a matrix and a vector, avoiding the explicit summation?
Hint
The shape of \(\nabla \mathcal L(\boldsymbol w)\) is the same as of \(\boldsymbol w\), i.e., \(d\times 1\). Now observe that
What should we multiply by this vector to obtain \(\nabla \mathcal L\)?
Question
What is hessian \(\nabla^2 L(\boldsymbol w)\)?
Answer
where
Breast cancer dataset: numeric optimization#
Fetch the dataset:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target
X.shape, y.shape
((569, 30), (569,))
Apply the gradient descent algorithm to the logistic regression:
import numpy as np
from scipy.special import expit
def logistic_regression_gd(X, y, C=1, learning_rate=0.01, tol=1e-3, max_iter=10000):
w = np.random.normal(size=X.shape[1])
gradient = X.T.dot(expit(X.dot(w)) - y) + C * w
for i in range(max_iter):
if np.linalg.norm(gradient) <= tol:
return w
w -= learning_rate * gradient
gradient = X.T.dot(expit(X.dot(w)) - y) + C * w
print("max_iter exceeded")
return w
Fit the logistic regresion on the whole dataset:
%time w = logistic_regression_gd(X, y, learning_rate=2e-7, max_iter=10**6)
w
max_iter exceeded
CPU times: user 8min 15s, sys: 14 s, total: 8min 29s
Wall time: 1min 11s
array([ 3.35199538, 0.12469049, 0.59960879, -0.00612838, -0.42058358,
0.05986231, -1.63588522, -0.19684143, -0.10281996, -0.15783997,
0.6166746 , 0.23132946, 0.4444697 , -0.20410911, 0.27713734,
0.40239203, -1.21511884, -0.36806714, 0.05320817, -0.51422639,
1.12859869, -0.50453899, -0.53031481, -0.05534487, 0.39546217,
-0.46421608, -0.91304783, -1.18284503, 0.77318941, 1.69515415])
Calculate the accuracy score:
from sklearn.metrics import accuracy_score
accuracy_score(expit(X.dot(w)) > 0.5, y)
0.9525483304042179
Compare with sklearn:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(fit_intercept=False, max_iter=5000)
log_reg.fit(X, y)
print(log_reg.score(X, y))
print(accuracy_score(log_reg.predict(X), y))
log_reg.coef_
0.9595782073813708
0.9595782073813708
array([[ 2.1975179 , 0.11135549, -0.07305484, -0.00335447, -0.16745265,
-0.41425964, -0.67991182, -0.36939821, -0.2430434 , -0.02410403,
-0.02336256, 1.20717466, 0.04631705, -0.09649416, -0.01903567,
0.01810195, -0.03747183, -0.04324005, -0.04367351, 0.00878879,
1.29284173, -0.33773315, -0.12388807, -0.02474166, -0.31037867,
-1.14588466, -1.63908301, -0.70741377, -0.73405013, -0.11303725]])
np.linalg.norm(w - log_reg.coef_)
3.751875389247443
Multinomial logistic regression#
Recall that the loss function in this case is
One can show that
Gradient Descent for Multinomial Logistic Regression#
In the context of multinomial logistic regression, the objective is to minimize the negative log-likelihood function.
Gradient Descent Update Rule: $\( \text{New Parameter} = \text{Old Parameter} - \text{Learning Rate} \times \text{Gradient} \)$
Example#
Let’s consider a multinomial logistic regression problem with three categories (classes: A, B, C) and two features (variables: \( X_1, X_2 \)).
Logistic function: $\( P(Y=k|X) = \frac{e^{X \cdot \beta_k}}{\sum_{i=1}^{K} e^{X \cdot \beta_i}} \)$
Negative Log-likelihood: $\( \text{Negative Log-Likelihood} = -\sum_{i=1}^{N}\sum_{k=1}^{K} y_{i,k} \log\left(\frac{e^{X_i \cdot \beta_k}}{\sum_{j=1}^{K} e^{X_i \cdot \beta_j}}\right) \)$
Gradient: $\( \nabla \text{Negative Log-Likelihood} = X^T \cdot (P - Y) \)$
Update rule: $\( \beta_k = \beta_k - \text{Learning Rate} \times X^T \cdot (P - Y) \)$
Where:
\( N \) is the number of observations,
\( K \) is the number of categories,
\( X_i \) is the feature vector for observation \( i \),
\( y_{i,k} \) is the indicator variable (1 if observation \( i \) is in category \( k \), 0 otherwise)
Newton’s Method#
To apply Newton’s method to find the optimal weights of logistic regression, need: dataset, logistic regression model, sigmoid function, objective function (loss function), calculate the gradient and hessian matrix, convergence criteria.
Logistic regression model#
The logistic regression model predicts the probability of a binary outcome, represented as:
Here, \( \mathbf{w} \) is the vector of weights, and \( \mathbf{x} \) is the vector of features.
Sigmoid function#
Define a sigmoid function to transform the linear combination of weights and features into probabilities:
Apply Newton’s method for logistic regression:
import numpy as np
from scipy.special import expit
def logistic_regression_newton(X, y, C=1, tol=1e-3, max_iter=10000):
nw = np.random.normal(size=X.shape[1])
for i in range(max_iter):
# Compute predicted probabilities
predictions = expit(X.dot(nw))
# Compute gradient
gradient = X.T.dot(predictions - y) + C * nw
# Compute Hessian matrix
hessian = X.T.dot(np.diag(predictions * (1 - predictions))).dot(X) + C * np.eye(X.shape[1])
# Update weights using Newton's method
nw -= np.linalg.inv(hessian).dot(gradient)
# Check for convergence
if np.linalg.norm(gradient) <= tol:
return nw
print("max_iter exceeded")
return nw
Fit the logistic regression on the whole dataset:
nw = logistic_regression_newton(X, y, max_iter=10**5)
nw
max_iter exceeded
array([4.33630900e+03, 6.39557000e+03, 2.78729200e+04, 1.65216100e+05,
3.30145200e+01, 2.85902100e+01, 1.64425707e+01, 9.18111400e+00,
6.21844000e+01, 2.24436600e+01, 1.01417400e+02, 4.35675700e+02,
7.14114700e+02, 7.54524800e+03, 2.56893700e+00, 7.65345400e+00,
9.28083460e+00, 3.51918200e+00, 7.34841900e+00, 1.29807030e+00,
4.77658900e+03, 8.39488000e+03, 3.10611200e+04, 1.99527100e+05,
4.46105400e+01, 6.52141000e+01, 5.93468670e+01, 2.65766310e+01,
9.64778000e+01, 2.83608200e+01])
Calculate the accuracy score:
from sklearn.metrics import accuracy_score
accuracy_score(expit(X.dot(nw)) > 0.5, y)
0.6274165202108963
Newton’s Method for Multinomial Logistic Regression#
Let’s take the same example from Gradient Descent: $\( P(Y=k|X) = \frac{e^{X \cdot \beta_k}}{\sum_{i=1}^{K} e^{X \cdot \beta_i}} \)$
Hessian Matrix: $\( \text{Hessian} = X^T \cdot W \cdot X \)$
Gradient: $\( \nabla \text{Negative Log-Likelihood} = X^T \cdot (P - Y) \)$
Update Rule: $\( \beta_k = \beta_k - \left(X^T \cdot W \cdot X\right)^{-1} \cdot X^T \cdot (P - Y) \)$
Where:
\( P \) is the matrix of predicted probabilities,
\( Y \) is the matrix of actual category indicators,
\( W \) is a diagonal matrix with elements \( P_k \cdot (1 - P_k) \), where \( P_k \) is the predicted probability for category \( k \).
Breast cancer#
Gradient Descent#
Install plotly and ipywidgets:
#pip install plotly
#pip install ipywidgets
Show code cell source
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
class LogisticRegressionGD:
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.learning_rate = learning_rate
self.n_iterations = n_iterations
self.weights = None
self.bias = None
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0
# Gradient descent
for _ in range(self.n_iterations):
model = np.dot(X, self.weights) + self.bias
y_predicted = 1 / (1 + np.exp(-model))
dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
db = (1 / n_samples) * np.sum(y_predicted - y)
self.weights -= self.learning_rate * dw
self.bias -= self.learning_rate * db
def predict(self, X):
linear_model = np.dot(X, self.weights) + self.bias
y_predicted = 1 / (1 + np.exp(-linear_model))
y_predicted_cls = [1 if i > 0.5 else 0 for i in y_predicted]
return y_predicted_cls
# Create and train the model
model = LogisticRegressionGD()
model.fit(X_train, y_train)
# Predictions and Evaluation
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions)}")
Accuracy: 0.9883040935672515
Visualize the Data:
Show code cell source
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc
# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Convert to a DataFrame for easier visualization
df = pd.DataFrame(data=X, columns=data.feature_names)
df['target'] = y
# Plotting distributions of the first two features
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df[df['target'] == 0][df.columns[0]], color='red', label='Malignant')
sns.histplot(df[df['target'] == 1][df.columns[0]], color='blue', label='Benign')
plt.title('Distribution of ' + df.columns[0])
plt.legend()
plt.subplot(1, 2, 2)
sns.histplot(df[df['target'] == 0][df.columns[1]], color='red', label='Malignant')
sns.histplot(df[df['target'] == 1][df.columns[1]], color='blue', label='Benign')
plt.title('Distribution of ' + df.columns[1])
plt.legend()
plt.tight_layout()
plt.show()
# Continue with the model training and evaluation process
# ...
# After training the model and making predictions:
# Confusion Matrix
cm = confusion_matrix(y_test, predictions)
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')
plt.title('Confusion Matrix')
plt.show()
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, predictions)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
Newton’s method#
Show code cell source
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Standardize the dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
class LogisticRegressionNewtonMethod:
def __init__(self, n_iterations=10):
self.n_iterations = n_iterations
self.weights = None
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
for _ in range(self.n_iterations):
scores = np.dot(X, self.weights)
predictions = 1 / (1 + np.exp(-scores))
# Gradient and Hessian
gradient = np.dot(X.T, predictions - y) / n_samples
H = np.dot(X.T * predictions * (1 - predictions), X) / n_samples
# Update weights
self.weights -= np.linalg.solve(H, gradient)
def predict(self, X):
scores = np.dot(X, self.weights)
predictions = 1 / (1 + np.exp(-scores))
return [1 if i > 0.5 else 0 for i in predictions]
# Create and train the model
model = LogisticRegressionNewtonMethod()
model.fit(X_train, y_train)
# Predictions and Evaluation
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions)}")
Accuracy: 0.9239766081871345
Visualize the Data:
Show code cell source
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, confusion_matrix
# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Convert to DataFrame for easier plotting
df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y
# Histograms for each feature in the dataset
df.drop('target', axis=1).hist(bins=15, figsize=(15, 10))
plt.suptitle('Feature Distributions')
plt.show()
# Correlation heatmap
plt.figure(figsize=(10, 10))
sns.heatmap(df.corr(), annot=False, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()
# Standardize the dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
# Logistic Regression with Newton's Method
# ... [Model implementation code here]
# Create and train the model
model = LogisticRegressionNewtonMethod()
model.fit(X_train, y_train)
# Predictions and Evaluation
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions)}")
# Confusion Matrix
cm = confusion_matrix(y_test, predictions)
sns.heatmap(cm, annot=True, fmt="d")
plt.title('Confusion Matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
Accuracy: 0.9239766081871345
In this example, the code uses the breast cancer dataset and performs logistic regression optimized by Newton’s Method
Show code cell source
import numpy as np
import pandas as pd
from sklearn import datasets
import plotly.graph_objects as go
import plotly.express as px
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
#loading breast cancer dataset
cancer = load_breast_cancer()
cancer_df = pd.DataFrame(data=np.c_[cancer['data'], cancer['target']],
columns=np.concatenate([cancer['feature_names'], ['target']]))
#for visualization
features = ['mean radius', 'mean texture']
cancer_df['target'] = np.where(cancer_df['target'] == 0, 0, 1)
#data prep
X = cancer_df[features]
y = cancer_df['target']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
#logistic regression model
model = LogisticRegression(solver='newton-cg')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')
#plot decision boundry
def plot_decisionboundary(X, y, features):
fig = go.Figure()
for threshold in np.linspace(0.1, 0.9, 9):
model = LogisticRegression()
model.fit(X, y > threshold)
y_pred = model.predict(X)
trace = go.Scatter(x=X[:, 0], y=X[:, 1], mode='markers',
marker=dict(color=y_pred, colorscale='Viridis', size=10,
line=dict(color='black', width=0.5)),
name=f'Threshold={threshold:.1f}')
fig.add_trace(trace)
fig.update_layout(title='Logistic Regression Decision Boundary ',
xaxis=dict(title=features[0]),
yaxis=dict(title=features[1]),
sliders=[dict(steps=[dict(method='relayout',
args=['shapes', []])],
active=0,
visible=True,
x=0.1,
y=0,
len=0.9)])
fig.show()
plot_decisionboundary(X_scaled, y, features)
# visualize confusion matrix
def plot_confusionmatrix(X, y):
fig = go.Figure()
for threshold in np.linspace(0.1, 0.9, 9):
model = LogisticRegression()
model.fit(X, y > threshold)
y_pred = model.predict(X)
conf_matrix = confusion_matrix(y > threshold, y_pred)
trace = go.Heatmap(z=conf_matrix, zmin=0, zmax=len(X), colorscale='Viridis',
x=['Predicted 0', 'Predicted 1'], y=['Actual 0', 'Actual 1'],
name=f'Threshold={threshold:.1f}')
fig.add_trace(trace)
fig.update_layout(title='Confusion Matrix ',
xaxis=dict(title='Predicted Label'),
yaxis=dict(title='Actual Label'),
sliders=[dict(steps=[dict(method='relayout',
args=['shapes', []])],
active=0,
visible=True,
x=0.1,
y=0,
len=0.9)])
fig.show()
plot_confusionmatrix(X_scaled, y)
Accuracy: 90.35%
Show code cell source
import plotly.graph_objects as go
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc
import numpy as np
def plot_roc_curve(X, y, threshold):
fig = go.Figure()
model = LogisticRegression()
model.fit(X, y > threshold)
y_pred_prob = model.predict_proba(X)[:, 1]
fpr, tpr, _ = roc_curve(y, y_pred_prob)
auc_value = auc(fpr, tpr)
trace = go.Scatter(x=fpr, y=tpr, mode='lines',
name=f'Threshold={threshold:.1f}, AUC={auc_value:.2f}')
fig.add_trace(trace)
fig.update_layout(title='ROC Curve',
xaxis=dict(title='False Positive Rate'),
yaxis=dict(title='True Positive Rate'))
fig.show()
# Set your data X and y here
# X = ...
# y = ...
# Set the threshold value
threshold_value = 0.5
# Plot ROC curve with the specified threshold
plot_roc_curve(X, y, threshold_value)